Preparing for Your Professional Machine Learning Engineer Journey

Google Cloud PMLE( Professional Machine Learning Engineer )專業機器學習工程師認證考試

Sample Question 1 / Question ID: 66

Question / 題目

You work for a retail company. You have a managed tabular dataset in Vertex AI that contains sales data from three different stores. The dataset includes several features, such as store name and sale timestamp. You want to use the data to train a model that makes sales predictions for a new store that will open soon. You need to split the data between the training, validation, and test sets. What approach should you use to split the data?

您在一家零售公司工作。您在 Vertex AI 中有一個託管的表格數據集,其中包含來自三家不同商店的銷售數據。該數據集包含多個特徵,例如商店名稱和銷售時間戳。您想使用這些數據來訓練一個模型,藉此預測即將開業的新商店的銷售額。您需要將數據拆分為訓練集、驗證集和測試集。您應該使用哪種方法來拆分數據?

A: Use Vertex AI manual split, using the store name feature to assign one store for each set

A: 使用 Vertex AI 手動拆分,並使用商店名稱特徵為每個數據集分配一家商店

B: Use Vertex AI default data split

B: 使用 Vertex AI 默認數據拆分

C: Use Vertex AI chronological split, and specify the sales timestamp feature as the time variable

C: 使用 Vertex AI 時間順序拆分,並指定銷售時間戳特徵作為時間變量

D: Use Vertex AI random split, assigning 70% of the rows to the training set, 10% to the validation set, and 20% to the test set

D: 使用 Vertex AI 隨機拆分,將 70% 的行分配給訓練集,10% 分配給驗證集,20% 分配給測試集


Analysis / 解析

A. Wrong. Using a manual split based on the store name means the model will train entirely on data from certain stores and validate/test on another. Since the objective is to predict sales for a new store that doesn't exist in the historical data, splitting by existing stores will not account for the time-series nature of sales forecasting. More importantly, you cannot evaluate how well the model handles future trends if you partition by location rather than time.

A. 錯誤。使用基於商店名稱的手動拆分意味著模型將完全在某些特定商店的數據上進行訓練,並在另一家商店上進行驗證/測試。由於目標是預測歷史數據中不存在的「新」商店的銷售額,因此按現有商店進行分區將無法兼顧銷售預測的時間序列特性。更重要的是,如果您按位置而非時間進行拆分,將無法評估模型處理未來趨勢的能力。

B. Wrong. Vertex AI's default split for tabular datasets is a random split (80% training, 10% validation, 10% testing). Random splits are inappropriate for time-series or sequential data like sales transactions, as they introduce data leakage (the model would use future data points to predict past data points during training).

B. 錯誤。Vertex AI 對表格數據集的默認拆分是隨機拆分(80% 訓練、10% 驗證、10% 測試)。隨機拆分不適用於銷售交易等時間序列或順序數據,因為它們會引入數據洩漏(模型在訓練期間會使用未來的數據點來預測過去的數據點)。

C. Correct. Sales data is inherently time-dependent (time-series data). When training a model to predict future sales—especially for a new store—you must use a chronological data split based on a timestamp feature. This ensures that the model trains on earlier data, validates on later data, and tests on the most recent data, accurately simulating how the model will perform in production when forecasting the future.

C. 正確。銷售數據本質上具有時間依賴性(時間序列數據)。當訓練模型來預測未來銷售額時(特別是針對新商店),您必須使用基於時間戳特徵的時間順序(按時間前後)數據拆分。這可以確保模型在早期數據上進行訓練,在較晚的數據上進行驗證,並在最新的數據上進行測試,從而準確模擬模型在生產環境中預測未來時的實際表現。

D. Wrong. As with the default split, a random split randomly assigns rows across the training, validation, and test sets. In sales and timestamped data, this causes data leakage because information from the "future" is leaked into the training set, leading to overly optimistic evaluation metrics that will fail in production.

D. 錯誤。與默認拆分一樣,隨機拆分會將各行隨機分配到訓練集、驗證集和測試集中。在包含銷售和時間戳的數據中,這會導致數據洩漏,因為來自「未來」的信息會洩漏到訓練集中,從而導致評估指標過於樂觀,但上線生產環境後則會失效。